mmg_233_2013_genetics_genomicswikiaorg-20200214-history
PCA
Ancestry Informative Markers (AIMs) are polymorphisms which exhibit detectable allele frequency differences in different human populations. Several studies have shown that thousands of SNPs have varying frequencies across different continental populations 1. Moreover, these studies have also reduced the number of such markers down to only a few hundred. The large scale nature of the work done in population genetics and genomics aimed at identifying AIMs has set up a framework for population stratification adjustments in GWAS as well as for admixture mapping . This article will focus on one particular analysis done in the context of predicting ancestry from AIMs, which is Principal Component Analysis or PCA. __TOC__ PCA Principal component analysis is a multivariate statistical method which aims to find linear and uncorrelated variables which when taken together can explain all the variation seen in the data as a whole. An example: : Imagine we need to come up with a formula that allows us to summarize students' performance based on 3 exam scores: g, 'c and a''. An obvious solution would be to take the average between the 3 exam scores: 1/3*''g + 1/3*''c'' + 1/3*''a. In this case, the set of linear coefficients in front of the grades, s=(1/3,1/3,1/3), represents the vector of linear combinations. PCA finds such (orthogonal) standardized linear coefficients which can explain all the variance in the original data. Every observation (i.e. exam) has its own linear coefficient. So let's assume we use this simple arithmetic formula on all the exam scores for a number of students and find that the variance of all the averages is some value X. How can we explain this variance? Some of it will be explained by taking into account ''g scores alone, another portion from c'' scores alone and another portion from ''a ''scores. When these portions of "variance explained" are added all together, they will give us the total observed variance in our data'.''' One fundamental feature of PCA is that the first few components explain most of the variance and the rest of the components do not explain much of the variance. So it is fairly common to come across GWAS studies where only the first 10 principal components are used, as they explain over 80% of the phenotypic variation. PCA and AIMs Similarly to the example above, PCA generates as many principal components as we have observations (in the case of AIMs, these observations are the genotype at each locus). Representation of alleles from the genotype form to numbers (e.g. from A:G to x:y where both x,y are numbers) is very important. One common choice for such values (also known as dummy variables) is 0 (A:A), 1 (A:G) and 2 (G:G) (Figure 1). In this article we chose to represent each nucleic acid with a unique number. So A:G becomes 1:3 and at the end we have two numbers representing each AIM. The values of these number do not matter as long as they are unique. Applying PCA to 23andme genotypes The paper by Kosoy et al. 1 identified about 128 AIMs which explain genetic variation across continental continentalPopulations.jpg AS_fullGenome.jpg 1KGs.jpg 3d_cube_aim.png screenShoot1.png Flags_PCplot.png population almost as well as when hundreds of thousands of markers are used (see Slide 1). From these 128 SNPs, about 70 were avilable in the raw genotype data from 23andMe. After extraction of the AIMs of interest and conversion into numerical values of all 70 of them, I then went on the 1000 Genomes database which has full SNP data on 1092 individials of 14 different nationalities. After doing the same manipulations to the original data as indicated for 23andMe, I then used the R statistical programming language to run PCA analysis on the data. Some visualizations of it are shown on several images in the slide folder of this article. References 1. Kosoy, R. et al. Hum Mutat. 2009 January; 30(1): 69–78 . 2. Crawley, M. J., The R book, 2007 edition 3. Price, A.L. et al Nature Genetics 2006; 38: 904 - 909